A Thank-You Letter

Dear Admission Officer,

Appreciate for your time.

This report is a brief application of a time series analysis. I learned such mathematical theory and programming all by myself without external guide, and the document is just a personal note rather than a formal paper, so please be patient if any mistake happened. Besides, my original purpose is to apply the ARIMA+GARCH model in practice (“for fun”) rather than to build a high performance investment model, thus little financial strategy is applied here.

In fact, I was a full-time professional in Ernst & Young, one of the big4 firms, as a certified public accountant. Perhaps accounting and corporate finance is the area of my expertise. Nonetheless, I expect more from myself as I decide to expand my intellectual horizon to build an inspired mind. So here I am just willing to show my self-taught knowledge to convey my efforts, aspiration, dedication and determination for your program.

It would be my honor to make every effort to realize academic value through your excellent program. Sincerely hope to see you in the near future.

Best regards,

MA Haoran

Introduction

The purpose is to build a time series model and predict the Hang Seng Index (HSI). Firstly, get the HSI index and build the training data. Then, explore the data and capture the features for a model. Find the ARIMA model with the least AIC and fit the residuals by GARCH model. Finally, make a 10 days forecast, compare the outcome with testing data.

Process Data

#loading package
library(tidyquant) 
library(fpp3)
library(tseries)
library(rugarch)
library(data.table)
library(plotly)
setwd("C:/Users/Apple/Desktop/RStudio Tour/note/TSA/Predict-HSI-by-ARIMA-GARCH")

The historical data is downloaded from Yahoo Finance([ctrl+click] to attach link) The training data covered from 2010-01-05 to 2021-01-29.

HSI <- fread("^HSI.csv")
HSI <- HSI[, .(Date, Close = as.numeric(Close))]
HSI <- HSI %>%
  tsibble(index = Date) %>%
  mutate(Return = difference(log(HSI$Close)) * 100)

HSI <- HSI %>%
  na.omit() %>%
  mutate(time = row_number()) %>%
  update_tsibble(index = time)
head(HSI)
## # A tsibble: 6 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2010-01-05 22280.  2.07      1
## 2 2010-01-06 22417.  0.613     2
## 3 2010-01-07 22269. -0.659     3
## 4 2010-01-08 22297.  0.123     4
## 5 2010-01-11 22412.  0.513     5
## 6 2010-01-12 22327. -0.379     6

Plot the Close Price

HSI %>%
  ggplot(aes(x = Date, y = Close)) +
  geom_line() +
  theme_tq() +
  labs(title = "HSI Index (date data)")

According to the plot, the Close Price is not stationary. In order to fit an ARIMA model, it could be helpful to difference the logarithm of the initial price.

Plot the return

Return = (ln(Close Price (t+1)) - ln(Close Price(t))) *100

HSI %>%
  ggplot(aes(x = Date, y = Return)) +
  geom_line() +
  theme_tq() +
  labs(title = "HSI Return (date data)")

The Return seems stationary without seasonality.

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

Unit Root Test

KPSS Test

H0:the data is stationary around a deterministic trend.

HSI%>%features(Return,unitroot_kpss)
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0308         0.1

KPSS test Pvalue = 0.1 > 0.05, so the null hypothesis is accepted.

ADF Test

H0: a unit root is present.

H1: time series is stationary.

adf.test(HSI$Return, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  HSI$Return
## Dickey-Fuller = -13.999, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary

ADF test Pvalue = 0.01 < 0.05, so the null hypothesis is rejected.

In conclusion, the time series is stationary for an ARIMA model.

Model:ARIMA+GARCH

ARIMA part

Find the ARIMA model with the least AIC.

arima.mod <- HSI %>%
    model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))

arima.mod
## # A mable: 1 x 1
##   `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
##                                                    <model>
## 1                                           <ARIMA(2,0,3)>

ARIMA residual analysis:

arima.mod%>%
    gg_tsresiduals()

Ljung-box Test

H0: Data cannot be distinguished from white noise.

augment(arima.mod)%>%features(.innov,ljung_box)
## # A tibble: 1 x 3
##   .model                                                 lb_stat lb_pvalue
##   <chr>                                                    <dbl>     <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE)  0.0101     0.920

According to the plot and ljung-box test, pvalue>0.05, so the residuals of the model cannot be distinguished from white noise. It is rational to accept ARIMA(2,0,3).

Detect the ARCH effect on residuals:

There would be an ARCH effect if the acf and pacf of residuals^2 are significant.

p1 <- arima.mod %>%
  augment() %>%
  select(.innov) %>%
  ACF(.innov ^ 2) %>%
  autoplot() +
  labs(y = "resid^2  acf")

p2 <- arima.mod %>%
  augment() %>%
  select(.innov) %>%
  PACF(.innov ^ 2) %>%
  autoplot() +
  labs(y = "resid^2  pacf")

gridExtra::grid.arrange(p1, p2)

Thus the ARCH effect on residuals is significant, it plausible to fit a GARCH model on residuals.

GARCH part

Try model: ARIMA(2,0,3)+GARCH(1,1)

spec <-
  ugarchspec(
    variance.model = list(
      model = "sGARCH",
      garchOrder = c(1, 1),
      submodel = NULL,
      external.regressors = NULL,
      variance.targeting = FALSE
    ),
    mean.model = list(armaOrder = c(2, 3),
                      include.mean = TRUE),
    distribution.model = "sged"
  )


fit <- ugarchfit(spec, HSI$Return,
                 solver = "hybrid")

Report the final model

print(fit)
## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : sGARCH(1,1)
## Mean Model   : ARFIMA(2,0,3)
## Distribution : sged 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error   t value Pr(>|t|)
## mu      0.025299    0.018994   1.33197 0.182871
## ar1    -1.312107    0.015853 -82.76469 0.000000
## ar2    -0.440748    0.020674 -21.31875 0.000000
## ma1     1.311944    0.002183 601.05156 0.000000
## ma2     0.429959    0.029283  14.68293 0.000000
## ma3     0.006806    0.011885   0.57264 0.566887
## omega   0.020438    0.007478   2.73295 0.006277
## alpha1  0.051404    0.009379   5.48086 0.000000
## beta1   0.933010    0.012840  72.66518 0.000000
## skew    0.916142    0.021129  43.36049 0.000000
## shape   1.383882    0.053582  25.82747 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error    t value Pr(>|t|)
## mu      0.025299    0.019449    1.30079 0.193331
## ar1    -1.312107    0.012415 -105.69088 0.000000
## ar2    -0.440748    0.016284  -27.06573 0.000000
## ma1     1.311944    0.002901  452.27417 0.000000
## ma2     0.429959    0.016539   25.99719 0.000000
## ma3     0.006806    0.006901    0.98623 0.324022
## omega   0.020438    0.008056    2.53687 0.011185
## alpha1  0.051404    0.011494    4.47233 0.000008
## beta1   0.933010    0.014748   63.26441 0.000000
## skew    0.916142    0.022258   41.16010 0.000000
## shape   1.383882    0.055471   24.94806 0.000000
## 
## LogLikelihood : -4046.441 
## 
## Information Criteria
## ------------------------------------
##                    
## Akaike       2.9933
## Bayes        3.0173
## Shibata      2.9933
## Hannan-Quinn 3.0020
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                          statistic p-value
## Lag[1]                       3.696 0.05453
## Lag[2*(p+q)+(p+q)-1][14]     8.013 0.19343
## Lag[4*(p+q)+(p+q)-1][24]    11.670 0.59853
## d.o.f=5
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      2.893 0.08898
## Lag[2*(p+q)+(p+q)-1][5]     6.596 0.06492
## Lag[4*(p+q)+(p+q)-1][9]     8.511 0.10215
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.928 0.500 2.000  0.1650
## ARCH Lag[5]     2.589 1.440 1.667  0.3551
## ARCH Lag[7]     3.664 2.315 1.543  0.3976
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  1.5655
## Individual Statistics:              
## mu     0.10650
## ar1    0.06487
## ar2    0.06350
## ma1    0.08800
## ma2    0.08856
## ma3    0.12493
## omega  0.09098
## alpha1 0.07925
## beta1  0.07443
## skew   0.18009
## shape  0.27474
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          2.49 2.75 3.27
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value    prob sig
## Sign Bias           0.6568 0.51138    
## Negative Sign Bias  0.2529 0.80038    
## Positive Sign Bias  2.0783 0.03778  **
## Joint Effect       10.3007 0.01618  **
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     20.06       0.3911
## 2    30     32.83       0.2845
## 3    40     38.35       0.4995
## 4    50     56.52       0.2146
## 
## 
## Elapsed time : 4.88823

Forecast

Build a forecast function to predict return in 10 days.

fore <- function(input, p, q, n) {
  spec <-
    ugarchspec(
      variance.model = list(
        model = "sGARCH",
        garchOrder = c(1, 1),
        submodel = NULL,
        external.regressors = NULL,
        variance.targeting = FALSE
      ),
      mean.model = list(armaOrder = c(p, q),
                        include.mean = TRUE),
      distribution.model = "sged"
    )
  fit <- ugarchfit(spec, input$Return,
                   solver = "hybrid")
  fore <- ugarchforecast(fit, n.ahead = n)
  pred <- fore@forecast$seriesFor
  pred <- as.data.table(pred)
  setnames(pred, "Return")
  pred[, time := seq.int(from = 1, to = n, by = 1)][]
  
}


pred <- fore(HSI, 2, 3, 10)

#View the predictions
pred
##           Return time
##  1:  0.154925180    1
##  2: -0.109081450    2
##  3:  0.138608698    3
##  4: -0.064147019    4
##  5:  0.092721230    5
##  6: -0.023742514    6
##  7:  0.059931009    7
##  8:  0.001473559    8
##  9:  0.041297048    9
## 10:  0.014809375   10

Examine the predictions with testing set. The raw test data starts from 2021-01-29, the last day of the traing set.

test <- fread("test.csv")
test <- test[, .(Date, Close = as.numeric(Close))]
test <- test %>%
  tsibble(index = Date) %>%
  mutate(Return = difference(log(test$Close)) * 100)

test <- test %>%
  na.omit() %>%
  mutate(time = row_number()) %>%
  update_tsibble(index = time)

# Preview the test data
head(test, 5)
## # A tsibble: 5 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2021-02-01 28893.  2.13      1
## 2 2021-02-02 29249.  1.22      2
## 3 2021-02-03 29307.  0.201     3
## 4 2021-02-04 29114. -0.664     4
## 5 2021-02-05 29289.  0.600     5

Plot forecast value and test value.

foretest <- function(pred, test) {
  p <- pred %>%
    tsibble(index = time) %>%
    left_join(test, by = "time") %>%
    mutate(Pred = Return.x,
           Actual = Return.y) %>%
    select(time, Date, Close, Pred, Actual) %>%
    pivot_longer(
      cols = c(Pred, Actual),
      values_to = "Return",
      names_to = "Type"
    ) %>%
    ggplot(aes(x = Date, y = Return, color = Type)) +
    geom_line() +
    scale_x_date(date_minor_breaks = "1 day") +
    ggsci::scale_color_aaas()+
    labs(title = "10 Trading Days Forecast")
  ggplotly(p)
}

foretest(pred, test)

Discrepancy could easily be observed, the actual curve seems 1 lag behind the prediction curve. However, the model is still capable of capturing some trend since the locations of inflection points are pretty close. Also, as the forecast period grows, the prediction curve gradually smooths.

(Perhaps the deviation is caused by efficient market, which maybe a good news for society. Just kidding!)

Of course the model has to be improved. Cross-validation would be useful. It might be fun to use bootstrap and function hilo() to draw the prediction interval.

Cross Validation

From year 2020 to 2021, splice every 100 trading days and train data and make a one day forecast.

For instance,

train Day1~Day100, forecast Day101; train Day2~Day101, forecast Day102; train Day3~Day103, forecast Day103,etc.

Get the ARMA p,q with the least AIC for “year > 2020”

CrossValidation <- HSI %>%
  filter(year(Date) >= 2020)

buildARIMA <- function(input) {
    arima.mod <- input %>%
        model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))
    
    list(
        model = arima.mod,
        ljung_box = augment(arima.mod) %>% features(.innov, ljung_box)
    )
}

buildARIMA(CrossValidation)
## $model
## # A mable: 1 x 1
##   `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
##                                                    <model>
## 1                                           <ARIMA(3,0,1)>
## 
## $ljung_box
## # A tibble: 1 x 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE) 0.000959     0.975

According to the result, ARMA(3,1) is acceptable.

Cross Validation Test

ARIMA(3,0,1)+GARCH(1,1)

CrossValidation <- CrossValidation %>%
  tsibble(index = time) %>%
  stretch_tsibble(.step = 1, .init = 100, .id = ".id") %>%
  group_by_key() %>%
  slice(n() - 99:0)
 spec <-
            ugarchspec(
                variance.model = list(
                    model = "sGARCH",
                    garchOrder = c(1, 1),
                    submodel = NULL,
                    external.regressors = NULL,
                    variance.targeting = FALSE
                ),
                mean.model = list(armaOrder = c(3, 1),
                                  include.mean = TRUE),
                distribution.model = "sged"
            )

 
 
 
 Crossforecast <- function(data) {
   CrossFore <- vector()
   for (i in 1:tail(data$.id)[1]) {
     train <- data %>%
       filter(.id == i)
     
     fit <- ugarchfit(spec, train$Return,
                      solver = "hybrid")
     
     CrossFore[i] <-
       ugarchforecast(fit, n.ahead = 1)@forecast$seriesFor[1]
     
   }
   CrossFore
 }

 CrossPred<-Crossforecast(CrossValidation)
 
 tsCV<-HSI %>%
    filter(year(Date) >= 2020)%>%
    mutate(time=row_number())%>%
    filter(time>100)%>%
   mutate(Pred=CrossPred[1:(length(CrossPred)-1)])
 head(tsCV)
## # A tsibble: 6 x 5 [1]
##   Date        Close Return  time    Pred
##   <date>      <dbl>  <dbl> <int>   <dbl>
## 1 2020-05-29 22961. -0.743   101 -0.490 
## 2 2020-06-01 23733.  3.30    102  0.0586
## 3 2020-06-02 23996.  1.10    103 -0.957 
## 4 2020-06-03 24326.  1.36    104 -0.744 
## 5 2020-06-04 24366.  0.167   105 -0.329 
## 6 2020-06-05 24770.  1.64    106 -0.532

Plot the result

tsCV %>%
  pivot_longer(
    cols = c(Pred, Return),
    values_to = "Return",
    names_to = "Type"
  ) %>%
  ggplot(aes(x = Date, y = Return, color = Type)) +
  geom_line() +
  scale_x_date(date_minor_breaks = "1 day") +
  theme_tq() +
  labs(title = "Cross Validation Test")

Weakness:

The fluctuation of the forecast curve is significantly weaker than the actual curve.

Advantage:

The trend of the prediction curve is informative. Also, it is harmless to determine whether return is positive or negative by forecast value after June 2020.

To assure the reproducibility, all code and data have been uploaded to Github([ctrl+click] to attach link)